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ABSTRACT 

This  paper  presents  a  method  for  training  connectionist  networks  that 
adhere  to  the  principles  of  graded,  random,  adaptive,  and  interactive  prop¬ 
agation  of  information  (GRAIN).  While  onr  analysis  has  been  motivated 
by  onr  desire  to  find  a  learning  algorithm  that  would  work  in  this  envi¬ 
ronment,  we  have  succeeded  in  implementing  a  model  that  encompasses 
a  large  class  of  previous  connectionist  algorithms  under  the  same  theo¬ 
retical  principles  and  that  expands  the  scope  of  problems  they  can  learn. 
Simulations  show  examples  where  GRAIN  networks  successfully  approxi¬ 
mate  both  discrete  and  continuous  probability  distributions,  demonstrat¬ 
ing  that  their  scope  extends  beyond  what  can  be  learned  by  backpropsp 
gation  networks  or  standard  Boltzmann  machines. 

1  INTRODUCTION 

This  paper  presents  a  method  for  training  connectionist  networks  that  adhere 
to  the  principles  of  graded,  random,  adaptive  and  interactive  propagation  of 
information.  These  principles  have  emerged  from  issues  in  cognitive  nx>deling 
and  have  been  consolidated  as  a  research  program  named  GRAIN  (McClelland, 
1990)  committed  to  modelling  normal  and  disordered  cognition.  Our  analysis 
has  been  motivated  by  our  desire  to  develop  a  learning  algorithm  for  the  GRAIN 
environment  that  would  allow  learning  probabUity  distributions. 

Consider  a  network  presented  with  a  sample  S  of  exemplars  taken  from  a 
joint  probabiUty  distribution  of  input  and  teacher  vectors 

5  =  {  (xi,yi),(xa,y3),...(x„y,)  )  (1) 

where  Xi  =  (x,i...*j,/)  u  an  input  vector,  and  yj  =  (v,i...v,,o)i  the  corre¬ 
sponding  teacher  vector.  We  will  refer  to  the  components  of  a  teacher  vector  as 
teacher  untfj  . 

One  limitation  of  deterministic  networks  like  standard  backpropagation  is 
that  they  can  only  generate  a  unique  output  vector  for  each  input  rather  than 
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a  distribution  of  output  vectors.  This  unique  output  vector  is  usually  an  es¬ 
timate  of  the  low  order  statistics  (e.g.,  expected  values)  of  the  distribution  of 
teachers  with  information  about  the  higher  order  statistics  of  this  distribution 
being  omited.  Furthermore,  the  error  functions  minimized  by  most  learning 
algorithms  are  optimal  when  the  teacher  units  are  statistically  independent  but 
are  not  so  appropriate  when  interdependencies  appear.  For  instance,  minimizing 
sum  of  squares,  the  error  functicMi  typically  used  in  backpropagation,  converges 
to  the  conditional  arithmetic  means  of  each  teacher  unit  and  gives  optimal  re¬ 
sults  when  the  teachers  are  independent  Gaussian  variables.  In  the  same  fashion, 
mininuzing  cross-entropy,  an  error  measure  commonly  used  when  the  teachers 
are  Boolean  (0,1)  variables,  converges  to  the  conditional  probabilities  cS  each 
teacher  unit  and  is  the  best  when  each  teacher  unit  is  an  independent  Bernoulli 
random  variable. 

Omiting  the  higher  order  statistics  of  the  distribution  of  teachers  is  beneii- 
rial  when  the  problem  at  hand  is  well  modelled  by  deterministic  input-output 
functions  with  added  independent  noise  components.  On  the  other  hand,  this 
higher  order  information  is  necessary  in  other  important  situations.  Consider 
for  example  a  simple  English  to  Spanish  translation  problem  where  the  English 
word  "olive”  has  two  equally  likely  translations  one  of  Latin  root  "oliva”  and 
one  Oi  Arabic  root  "aceituna".  If  we  were  to  use  distributed  representations 
for  these  words,  backpropagation  will  learn  the  average  representations  of  the 
two  acceptable  alternatives,  a  blend  which  is  not  acceptable.  In  cases  like  this 
we  want  to  learn  likelihoods  of  distributed  patterns  ot  activation  rather  than 
expected  values  of  individual  units. 

If  we  want  to  learn  arbitrary  probability  distributions,  we  need  an  error 
function  capable  of  c^turing  high  order  statistics,  and  we  also  need  randomness, 
one  of  the  GRAIN  principles.  Eloltzmann  machines  are  intrinsically  random, 
and  their  learning  algorithm  minimizes  an  error  function  called  information 
gain  (Ackley  ei  al.,  1985)  that  describes  the  extent  to  which  the  distribution 
of  patterns  of  activation  produced  by  the  network  matches  the  distribution  of 
patterns  in  the  environment.  Unfortunately,  since  randomness  greatly  reduces 
learning  speed,  this  property  of  Boltzmann  machines  has  often  been  neglected 
with  more  emphasis  made  to  speed  them  up  (e.g.,  using  annealing  schedules  or 
mean  field  approximations). 

What  we  present  here  are  the  results  of  our  efforts  to  develop  a  learning  algo¬ 
rithm  that  adheres  to  the  GRAIN  principles.  We  call  the  algorithm  contrastive 
Hebbian  learning  (CEL),  a  name  inspired  in  the  work  of  Galland  and  Hinton 
(1989)  with  the  deterministic  Boltzmann  machine.  Contrastive  Hebbian  learn¬ 
ing  is  a  generalization  of  the  Boltzmann  learning  algorithm  (Ackley  et  al.,  1985) 
for  the  continuous  stochastic  case  and  it  includes  a  large  variety  of  previous 
connectionist  learning  algorithms  under  the  same  theoretical  principles.  In  this 
paper  we  present  a  theoretical  framework  for  understanding  how  the  learning 
algorithm  works,  and  a  series  of  simulations  where  GRAIN  networks  succeas- 
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fiilly  approximate  discrete  and  continuous  probability  distributicuu  of  various 
types. 


1.1  DETERMINISTIC  SKELETON 

Let  a  s  [oi ,  be  a  real  value  activation  column  vector.  Let  W  s  [wx ,  ...Wn] 

be  a  real  value  matrix  of  connections,  where  each  W|-  =  (u»i,x,  is  the  fan- 

in  column  vector  of  connections  to  unit  i.  The  default  implementation  of  GRAIN 
uses  the  update  equations  of  the  interactive  activation  and  competition  noodel 
(McClelland  and  Rumelhart,  1981)  : 

d  Of 

=  Xi  ((max.-  -  o.)  neti  -  d,  (o<  -  restj))  ;  neU  >  rest,  (2) 

d  flt 

—  =5  Xi  ((oj  -  mini)neti  -di{ai-~  rest,))  ;  net,  <  rest,-  (3) 

where  ntU  =  a^Wi,  mazi  is  the  maximum  activation  value  for  unit  i,  rest,-  the 
activation  when  the  net  input  is  sero,  mtnj  the  minimum  activation  value,  dj 
a  positive  constant  named  decay  which  controls  the  sharpness  of  the  activation 
function,  and  0  <  Aj  <  1  a  constant  which  controls  the  speed  of  processing.  Any 
other  continuous  Hop.'eld  update  equation  (Hopfield,  1964)  would  also  work 

If  the  weight  matrix  is  symmetric  the  equations  implement  a  version  of 
the  continuous  Hopfield  model  (Hopfield,  1984)  with  an  associated  Goodness 
function  of  the  following  form: 


GziH-S  (4) 

where  ' 

H  =  ia’'Wa  (5) 

is  the  harmony  or  consistency  between  the  network  activations  and  the  weight 
constraints.  The  stress 


n 

5  =  53d*s<  (6) 

i=l 

b  a  weighted  sum  of  penalty  terms,  Si,  for  the  activations  departing  from  rest 
value 

/  .  V  .  /  moxi  —  rttti  \ 

s,  =  (max,  -  resii)  log  I - )  “  («»<  -  resti) ;  m  >  reeU  (7) 

*The  pre*ait  GRAIN  implcmentatioo  alto  ha«  opticcal  logiftic  and  bx^jcrbolic  taagcot 
update  fuDctiona  aa  defined  in  (Hopfield,  I0S4). 
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•i  =  {mini  -  re$ii)  log  +  (“•  ”  *■***•)  *  “•  ^  ”**•  (®) 

The  decay  parameters, dj,  weight  the  relative  importaoce  of  stress  vs.  harmony 
for  each  unit,  and  correspond  to  the  gain  parameters  if  logistic  equilibrium 
activation  functions  ate  u^  (Hopfield,  1984). 

It  is  easy  to  show  (Movellan,  1990)  that  these  equations  implement  a  ver¬ 
sion  of  the  continuous  Hopfield  model  with  the  activation  vector  asymptotically 
equilibratmg  at  a  local  maximum  of  G. 


1.2  THE  RANDOM  COMPONENT 

Noise  is  iqjected  to  the  deterministic  skeleton  of  GRAIN  by  introducing  a  ran¬ 
dom  vector  V  to  the  net  vector 


net  =  a^  W  1/  (9) 

As  a  default  the  random  vector  is  made  of  independent  identically  distributed 
zero  mean  Gaussian  variables: 


i/~N(0,<rf,)l)  (10) 

where  stands  for  variance  at  time  t.  Formally,  GRAIN  is  a  Markovian 
diffusion  process  that  optimizes  the  goodness  function  subject  to  the  constraints 
imposed  by  the  random  vector. 

1.2.1  Stochastic  stability 

The  same  way  we  study  for  the  deterministic  skeleton  whether  the  activations 
stabilize,  we  may  investigate  in  the  stochastic  case  whetb^'  the  probability  dis¬ 
tribution  of  activation  states  stabilizes  over  time  and  whether  these  stable  points 
depend  on  the  starting  conditions.  For  simplicity  we  analyze  this  aspect  by  dis¬ 
cretizing  time  and  partitioning  the  activation  states  of  each  unit  into  m  states 
over  the  min-max  interval.  Let  p,j(f)  be  the  probability  of  entering  state  j  of 
the  m"  possible  states  at  the  transition  given  that  the  initial  state  is  i.  It  is 
easy  to  show  that  GRAIN  is  a  regular  Markov  process,  and  thus,  by  applying 
the  Markovian  basic  limit  theorem  (Taylor  k  Karlim,  1984)  it  follows  that: 

(1)  there  exists  a  limiting  distribution 

Py  =  ^top.,(l)  (11) 

(2)  the  distribution  is  unique  and  independent  of  the  starting  conditions 

Pi  =Po.  »■=  1-m"  (12) 
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(3)  this  limitiag  probability  distributioa  equals  the  long  run  proportion  of 
time  that  the  process  will  be  in  each  of  the  states. 

This  last  aspect  which  is  known  as  the  ergodic  property  provides  great  flex¬ 
ibility  for  employing  different  strategies  to  estimate  limiting  distribution  statis¬ 
tics  of  GRAIN  networks. 

2  TRAINING:  THE  CONTRASTIVE  THEO¬ 
REM 

lyaining  is  based  on  the  contrastive  theorem,  which  was  analyzed  in  a  previous 
p^er  (Movellan,90)  and  that  we  state  below. 

2.1  Contrastive  theorem 

Theorem:  Let  L(P)  be  afunction  of  the  m"  dimensional  vector  P  =  {Pi...Pm»}- 
Consider  the  following  two  cases  where  L  is  optimized  as  a  function  of  P  :  1)  a 
(-)  case  where  all  the  m"  P  dimensions  are  free,  and  2)  a  (-{-)  case  where  some 
constraints  have  been  imposed  (e.g.,  some  of  the  values  of  P  are  not  allowed  as 
solutions).  Let  be  a  value  of  P  that  achieves  i^~K  a  maximum  of  L  in  the 
free  case.  Let  be  a  value  of  P  that  achieves  a  maximum  of  L  for  the 
constrained  case.  Define  the  contrastive  function  CF  =  IF  L  has 

a  unique  maximum  on  the  (-)  case  THEN  it  follows  that:  1)  CF  >  0  ,  and  2) 
CF  =  0  if  an  only  if  p<"^  =  . 


Proof:  The  set  of  P  points  accessible  in  the  constrained  f-f )  case  is  a  subset 
of  the  set  of  points  accesible  in  the  free  (-)  case.  Since  L^~\  the  maximum  in 
the  (-)  case  is  unique  any  other  L  value  must  be  smaller.  That  is,  if  p^+)  p^~) 
then  and  CF  >  0.  The  only  case  where  is  when 

p(+)  =  p(-|Jv 

We  will  use  the  contrastive  theorem  to  develop  the  learning  algorithm.  P 
will  represent  the  probability  distributioa  of  the  m"  network  states.  In  nxxt 
cases  we  are  interested  in  learning  probability  distributions  conditional  on  input 
unit  patterns.  Using  the  same  idea  as  in  the  Boittmann  machine,  in  the  (-) 
case  we  will  clamp  the  input  units  and  let  the  rest  of  the  network  (hidden  and 
output  units)  run  free.  In  the  {+)  case  we  will  further  constrain  the  network 
by  also  clamping  the  output  units  to  the  environment.  The  learning  algorithm 
is  based  on  minimizing  the  contrastive  function  by  modifying  the  weight  and 
the  decay  parameters.  If  we  get  the  contrastive  function  to  be  zero,  then  the 
probability  distribution  of  the  free  and  clamped  cases  must  be  the  same  and 
thus  the  network  has  learned. 
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2.2  Contrastive  learning 

Learning  is  defined  as  reproducing  the  probability  distributi<»  of  the  environ¬ 
ment  conditional  on  the  input.  We  copjecture  that  the  equilibrium  distribution 
of  GRAIN  is  maximizing  expected  goodness  subject  to  a  noise  constraint.  This 
can  be  expressed  in  a  Lagrange  form  as 


l(W,  P)  =<  G(W,P)  >  -fiN(P)  (13) 

where  P  =  {Pi. ..Pm*}  *  probability  distribution  over  the  m"  network  states, 

<>  stands  for  expected  value,  N(P)  is  a  function  expressing  the  noise  constraint, 
and  ^  is  a  Lagrange  multiplier.  The  basic  limit  Markov  theorem  guarantees 
that  the  equilibrium  distribution  is  unique.  Thus,  if  our  conjecture  is  true  this 
distribution  is  the  unique  global  maximum  of  L  and  we  can  apply  the  contrastive 
theorem  on  the  L  function;  Define  (-t>)  as  the  case  where  inputs  and  outputs  are 
clamped  and  (-)  the  case  where  only  inputs  are  clamped.  Define  the  contrastive 
function  as 

CF  =  (14) 

Applying  the  contrastive  tieorew  it  follows  that  CF  is  always  >  0  and  if 
CF  =  0  the  limiting  distribution  is  the  same  in  the  (-f )  case,  whe  :  inputs  and 
outputs  are  clamped,  as  in  the  (-)  case,  where  only  inputs  are  clamped.  Thus  if 
CF  =  0  we  have  successfully  learned  the  desired  probability  distributions. 

We  can  perform  gradient  descent  on  the  contrastive  function  with  respect  to 
weights  and  with  respect  to  decays.  In  order  to  do  so  we  need  the  derivatives 
of  the  equilibrium  point  of  L.  Using  the  chain  rule  we  may  detompoee  these 
derivatives  in  two  components  ’ 


dL  _  di  dpk 

dwij  ~  dwij  ^  dpt  dxuij 


It  is  easy  to  show  that  the  first  part  of  equation  15  is  <  Oidj  >.  To  first 
order  the  second  part  of  equation  15  vanishes  because  we  are  at  a  stable  point 
of  L  ,  where  ^  =  0  for  all  k  Therefore,  to  first  order 


dCF 

dWij 


=<ai->di-> 


> 


(16) 


from  which  the  rule  for  truning  weights  is  obtained.  We  will  refer  to  this  rule 
as  contrastive  Hebbian  learning. 

weight  can  influence  the  value  of  the  L  maximum  bj'  changing  the  equilibrium  proba¬ 
bility  ditlribution  of  ilate*  and  aUo  by  changing  the  goodnew  value  of  each  state.  The  left 
tide  of  the  equation  indicates  the  total  derivative  where  both  these  influences  are  taken  into 
consideration.  The  right  side  expresses  these  separate  influences  as  two  additive  factors. 

*For  a  similar  vetmion  of  this  argument  see  (Hinton,  1989),  where  it  is  applied  to  the 
deterministic  Boltunann  machine.  • 
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Similarly,  to  first  order  the  derivatives  for  the  decay  terms  are 


.  |(+) 
ddi  • 


> 


(17) 


Moving  in  directions  opposite  to  the  derivatives  we  would  obtain  the  ^>prc^riate 
learning  rules. 


2.3  Sampling  strategies 

Since  the  learning  algorithm  needs  equilibrium  distribution  statistics,  it  is  im¬ 
portant  to  get  these  statistics  in  a  fast  and  accurate  way.  One  approach  is  to 
use  annealing  schedules  by  starting  the  settling  process  with  a  large  ncm  com¬ 
ponent  and  gradually  diminishing  it.  Another  approach  is  to  use  sharpening  or 
mean  field  annealing  where  initially  large  decay  values  are  slowly  replaced  by 
smaller  ones.  Combinations  of  sharpening  and  annealing  are  also  possible.  Both 
sharpening  and  annealing  schedules  are  oriented  to  sampling  activation  statistics 
when  the  network  is  visiting  the  attractor  with  biggest  goodness  value.  Since  the 
statistics  of  this  attractor  tend  to  dominate  the  desired  equilibrium  distribution, 
these  procedures  save  time  and  in  many  cases  provide  accurate  statistics.  Un¬ 
fortunately,  these  procedures  may  also  run  into  problems  when  the  network  has 
to  learn  probability  distributions  where  there  is  more  than  one  equally  desirable 
pattern  of  activation  for  the  same  input.  In  this  case  each  of  the  desired  pat¬ 
terns  will  have  a  corresponding  maximum  with  the  same  goodness  value.  Since 
annealing  schedules  are  designed  to  visit  only  one  of  the  maxima  at  a  time,  the 
obtained  statistics  will  be  biased  and  will  lead  to  unstabilities  in  the  learning 
process.  In  such  cases,  we  have  found  it  beneficial  to  let  the  network  visit  several 
large  attractors  by  settling  several  times  with  different  random  starting  values 
before  changing  the  weights.  Since  the  network  is  ergodic,  equilibrium  statistics 
using  one  or  many  restarts  converge,  but  in  practice  we  have  found  that  they 
are  obtained  faster  with  the  multiple  restarts  method.  In  our  simulations  we 
use  the  multiple  restarts  technique  and  do  not  use  annealing  or  sharpening. 


3  RELATION  TO  OTHER  MODELS 

GRAIN  is  currently  implemented  as  a  continuous  Bopfield  network  with  an 
added  random  component.  As  such,  GRAIN  is  a  Markovian  diffusion  process 
similar  to  that  used  by  Ratcliff  (1978)  to  model  reaction  time  distributions  in 
human  cognition.  GRAIN  is  also  a  variant  of  the  Gaussian  machine  developed 
by  Akiyama  et  al.  to  solve  optimization  problems,  but  contrary  to  Ratcliff’s 
model  or  to  the  Gaussian  machine,  GRAIN  can  be  trained.  The  name  con¬ 
trastive  Hebbian  learning  was  inspired  by  Galland  and  Hinton’s  reference  to  the 
contrastive  Hebbian  synapse  in  the  deterministic  Boltzmann  machine  (Galland 
and  Hinton,  1989).  The  contrastive  Hebbian  learning  algorithm  for  weights  was 
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first  applied  by  Hopfield  to  his  discrete  model  but  ^^theoretical  ground  was 
oflered  at  the  time  (Hopfield  ei  a/.,  1983).  A  mathematical  foundation  based  on 
statistical  mechanics  was  pven  with  the  discrete  stochastic  Boltzmann  machine 
(Ackley  et  aL,  1985).  The  algorithm  was  also  derived  from  statistical  mechanics 
principles  for  the  continuous  deterministic  case  (Peterson  and  Anderson,  1987; 
Peterson  and  Hartman,  1989;  Hinton,  1989).  Movellan  (1990)  noticed  that  CHL 
could  be  generalized,  without  reference  to  statistical  mechanics,  to  any  deter¬ 
ministic  continuous  Hopfield  network  by  using  the  contrastive  theorem.  In  this 
paper  we  generalize  the  learning  algorithm  for  the  continuous  stochastic  case 
and  propose  a  rule  for  training  the  decay  parameters.  GRAIN  encompasses  a 
large  set  of  models  as  special  cases.  By  turning  decays  to  zero,  lambdas  to  1, 
and  noise  to  zero  GRAIN  becomes  the  schema  model  (Rumelhart  et  al.  1986). 
By  adding  decay  terms,  GRAIN  implements  the  update  equations  of  the  in¬ 
teractive  activation  and  competition  (lAC)  model  (McClelland  and  Rumelhart, 
1981).  Both  the  schemata  and  the  lAC  model  are  versims  of  the  continuous 
Hopfield  model  (Hopfield,  1984).  If  we  add  Gaussian  noise  we  get  a  version  of 
the  Gaussian  machine  (Akiyama  et  aI.1989).  If  we  use  a  logistic  update  function 
with  large  gain,  lambdas  turned  to  1,  and  asynchronous  activation  update,  the 
network  approximates  the  discrete  Hopfield  model  (Hopfield,  1982).  If  we  then 
add  Gaussian  noise,  the  network  approximates  a  Boltzmann  machine  (Ackley 
ei  al,  1985).  In  all  cases  the  networks  can  be  trained  with  the  same  contrastive 
learning  algorithm. 

4  SIMULATIONS 

The  purpose  of  the  simulations  is  to  investigate  whether  CHL  can  be  used  to 
train  GRAIN  in  some  standard  problems  as  well  as  problems  outside  the  scope 
of  our  current  connectionist  learning  edgoritbms.  In  particular  we  show  the  re¬ 
sults  of  the  4  following  problems;  1)  Standard  XOR,  2)  TVanslation  problem, 
3)  Learning  Independent  Continuous  probability  distributions,  4)  Learning  con¬ 
tinuous  probability  distributions  governed  by  XOR.  In  all  simulations  the  LAC 
update  function  was  used  with  max  =  1.0,  min  =  -1.0,  rest=  0.0.  The  weights 
were  symmetric,  and  decays  were  maintained  constant  and  equal  for  all  units. 
Learning  was  done  in  epoch  mode  letting  the  network  settle  several  times  per 
pattern  with  different  random  starting  values  and  accumulating  the  statistics 
for  aU  the  patterns  before  chan^ng  the  weights.  No  annealing  or  sharpening 
schedules  were  used. 

4.0.1  Standard  XOR 

The  purpose  of  this  simulation  was  to  investigate  whether  GRAIN  could  be 
trained  to  solve  a  problem  requiring  hidden  units.  We  performed  25  learning 
simulations  with  different  random  starting  weights.  The  network  consisted  of 
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three  layers  (2  input  units,  four  hidden  units,  1  output  unit)  with  connections 
only  between  adjacent  layers.  Within  each  layer  the  units  were  interconnected 
with  self  connections  fixed  to  tero.  Initial  weights  where  sampled  from  a  (• 
1,1)  uniform  distribution.  Learning  was  done  in  epoch  mode  with  10  settling 
restarts  per  pattern.  Each  settlbg  consisted  of  100  initial  cycles  of  asynchronous 
activation  update  *  where  statistics  were  not  collected,  and  100  more  cycles 
where  statistics  were  collected.  Decays  were  set  at  0.1,  lambdas  at  0.1,  and  the 
Gaussian  noise  standard  deviations  were  set  at  1.0  for  each  unit.  The  atepsize 
constant  for  weight  adjustment  was  set  at  0.25.  A  0.01  weight  decay  constant 
was  also  applied  after  each  learning  epoch.  The  median  number  of  learning 
epochs  until  the  information  gain  error  measure  was  smaller  than  0.04  was  26. 
This  is  a  very  strict  learning  criterion*.  We  then  turned  the  network  to  (-1  ,  -f  1) 
Boltzmann  mode  by  setting  the  activations  to  1  if  the  net  input  combined  with 
noise  was  greater  than  1  or  to  -1  otherwise.  In  Boltzmann  mode  the  median 
number  of  epochs  was  65.  The  difference  was  statistically  significant  (Willcoxon 
p  <  0.001).  We  have  tried  r  number  of  variations  in  the  parameters  and  in  aU 
cases  using  graded  activations  was  benehtial  to  speed  up  learning.  The  difference 
was  less  striking  when  less  stringent  learning  criterion  were  used. 

4.0.2  Translation  problem 

The  purpose  of  the  simulation  was  to  investigate  whether  GRAIN  can  learn 
probabilistic  mappings  using  distributed  representations.  The  inspiration  for 
this  simulation  was  a  problem  that  arises  when  training  networks  with  dis¬ 
tributed  representations  to  do  translations  from  one  language  to  another.  In 
those  cases  where  a  word  has  two  or  more  acceptable  tranr'ations,  backprop- 
agation  converges  to  the  expected  values  of  each  teacher  unit,  producing  un¬ 
acceptable  blends  of  distributed  representations.  In  our  simulation  ’’words” 
were  encoded  as  random  binary  patterns  distributed  amongst  8  "English”  and 
8  "Spanish”  units.  There  were  24  additional  hidden  units  and  all  24+8-t'8  units 
were  fully  interconnected.  Sometimes  Spanish  units  were  clamped  to  get  a  cor¬ 
rect  translation  in  the  English  module,  and  sometimes  the  English  units  were 
clamped  to  get  a  translation  in  the  Spanish  module  (see  Table  1).  Initial  weights 
were  sampled  from  a  (-1,1)  uniform  distribution.  Each  settling  consisted  of  500 
initial  cycles  of  synchronous  activation  update  *  where  statistics  were  not  col¬ 
lected,  and  200  more  cycles  where  statistics  were  collected.  Decays  were  set 
at  0.1,  lambdas  at  0.1,  and  the  Gaussian  noise  standard  deviations  were  set  at 
1.0  for  each  unit.  The  stepsize  constant  for  weight  adjustment  was  set  at  0.01 
for  the  hrst  200  epochs,  0.005  the  next  200  epochs  and  0  001  for  the  last  200 
epochs.  Learning  was  done  in  epoch  iix>de  with  10  settling  restarts  per  pat- 

aiynchronous  mode  one  randomly  chosen  unit  is  updated  at  a  time.  A  cycle  is  defined 
as  performing  as  many  random  updates  as  the  number  of  units  in  tbe  networlr. 

*For  an  explanation  of  the  leamittg  oiteriaa  see  the  appendix. 

*In  synchronous  mode  all  the  units  in  tbe  network  are  updated  at  the  same  time. 
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tern.The  results  with  GRAIN  after  600  karniog  epochs  are  in  l^le  1  which 
shows  that  a  good  approximation  to  the  desired  probabilities  was  obtained. 
Most  importantly,  for  the  ambiguous  words,  where  more  than  one  translation 
is  possible,  the  network  was  near  always  in  one  of  the  correct  alternatives  and 
did  not  generate  unacceptable  blends.  In  principle  this  probkm  is  also  solvabk 
in  Boltzmann  mode  but  we  have  not  examined  this  case. 


Input 

7>ansiation 

eaaa  1.000  [0.998] 

home 

tomorrow 

mahana  1.000 

0.898] 

later 

mahana  1.000 

0.698] 

olive 

aceituna  0.500 

0.399 

oliva  0.500  [0.595] 

be 

aer  0.500  [0.700] 

eatar  0.500  [0.298] 

easa 

houae  0.500  [0.367] 

home  0.500  [0.582] 

tomorrow  0.500  [0.499] 

later  0.500  [0.398! 

aceituna 

olive  1.000 

0.992] 

oliva 

olive  1.000 

aer 

be  1.000  [0.900] 

esiar 

Table  1:  Traaslatioa  problem:  Column  one  shows  the  input  pattern  and  columns 
2  and  3  the  possible  translations.  The  two  numbers  for  each  translation  rep¬ 
resent  the  desired  probability  and,  in  brackets,  the  obtained  probability  of  the 
translations  after  600  learning  epochs.  A  pattern  was  considered  correct  if  each 
outDut  unit  activation  was  within  a  0.4  range  of  the  desired  value  ( -0.9  or  -(-0.9). 


4.0.3  Learning  independent  continuous  probability  distributions 

This  problem  cannot  be  learned  with  backpropagation  networks  or  with  Boltz¬ 
mann  machines.  The  GRAIN  network  consisted  of  5  output  units  connected 
to  10  fully  interconnected  hidden  units.  Self  connections  were  allowed.  Initial 
weights  where  sampled  from  a  (-1,1)  uniform  distribution.  Each  output  unit  was 
trained  to  reproduce  a  continuous  probability  distribution  according  to  Table  2. 
Learning  was  done  in  epoch  mode  with  10  settling  restarts  per  pattern.  Each 
settling  consisted  of  20  initial  cycles  of  asynchronous  activation  update  where 
statistics  were  not  collected,  and  500  more  cycles  where  statistics  were  collected. 
Decays  were  set  at  0.1,  lambdas  at  0.1,  and  the  Gaussian  noise  standard  devia¬ 
tions  were  set  at  1.5  for  each  unit.  The  stepsize  constant  for  weight  adjustment 
was  set  at  0.005. 

Figure  1  shows  10,000  activation  cycles  of  the  5  output  units  after  1000 
karning  epochs.  Figure  2  shows  the  probability  density  functions  of  the  first 
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Ouput  unit 

Distribution 

Expected  Value 

1 

Binomial 

0 

2 

Constant 

0 

3 

Uniform 

0 

4 

Constant 

-0.5 

5 

-0.5 

T&ble  2:  Desired  probability  distributions  for  each  of  the  5  output  units 


three  and  the  last  two  output  units  throughout  learning.  It  can  be  seen  that 
the  output  distributions  successfully  approximate  the  desired  distributions  given 
the  constraints  imposed  by  the  injected  noise.  The  pairwise  correlations  of  the 
output  unit  activations  after  training  were  tero  to  the  third  decimal  place. 

4.1  Learning  continuous  probability  distributions  governed 
by  XOR 

This  is  a  problem  that  cannot  be  learned  with  Boltzmann  machines  or  back- 
propagation  networks  and  that  necessitates  hidden  units.^Phere  were  2  input 
units,  1  output  unit,  and  10  hidden  units.  Initial  weightswheK  sampled 
a  (-1,1)  uniform  distribution.  The  network  was  fully  interboimected,  with 
connections  allowed.  The  probability  distribution  to  be  learned  by  the  output 
unit  depended  on  the  input  conditions  as  indicated  in  Table  3. 


Input  Units 

Distribution 

Expected  Value 

-1  -1 

Constant 

0 

-1  1 

Binomial 

0 

1  -1 

Binomial 

0 

1  1 

Constant 

0 

Table  3:  Desired  output  probability  distributions  as  a  function  of  the  input 
patterns. 


Learning  was  done  in  epoch  mode  with  4  settling  restarts  per  pattern.  Each 
settling  consisted  of  20  initial  cycles  of  asynchronous  activation  update  where 
statistics  were  not  collected,  and  500  more  cycles  where  statistics  were  collected. 
Decays  were  set  at  0.1,  lambdas  at  0.1,  and  the  Gaussian  nobe  standard  devia¬ 
tions  were  set  at  1.0  for  each  unit.  The  stepsbe  constant  for  weight  adjustment 
was  set  at  0.001.  Figure  3  shows  the  results  after  600  learning  epochs.  It  can 
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be  Been  tb&t  the  probability  distribution  of  the  activations  conditional  on  the 
four  input  patterns  successfully  approximates  the  desired  distributions  given  the 
constraints  imposed  by  noise. 

5  CONCLUSION 

Backpropagation  networks  can  only  learn  central  tendencies  of  independent  out¬ 
put  units,  not  probability  distributions  over  patterns  of  output  units.  Stochas¬ 
tic  Boltzmann  machines  can  in  principle  learn  probability  distributions  but  are 
limited  to  discrete  binary  values.  GRAIN  extends  the  scope  of  previous  connec- 
tionist  learning  algorithms  and  encompasses  a  large  category  of  them  under  the 
same  theoretical  framework.  Our  simulations  show  that  GRAIN  networks  can 
be  successfully  trained  to  approximate  probability  distributions  conditional  on 
the  input  patterns. 

We  hope  this  work  contributes  to  a  renewal  of  interest  in  going  beyond  the 
learning  of  central  tendencies  to  the  modeling  of  distributions.  Considerable 
works  needs  to  be  done.  Analytically,  it  remains  to  be  shown  rather  than  just 
conjectured  that  GRAIN  is  maximizing  a  function  of  the  form  proposed  in  equa¬ 
tion  13.  Secondly,  it  would  be  helpful  to  link  minimization  of  the  contrastive 
function  with  minimization  of  other  error  functions.  The  contrastive  theorem 
guarantees  that  when  the  contrastive  function  is  zero  we  have  learned  to  match 
the  envirironmentally  specified  probability  distributions  exactly.  Unfortunately 
the  theorem  does  not  say  whether  the  relationship  between  the  contrastive  func¬ 
tion  and  error  functions  such  as  information  gain  is  monotonic.  In  practice  we 
have  found  this  to  be  the  case,  with  both  information  gain  and  the  contrastive 
function  decreasing  as  learning  progresses  but  we  have  not  yet  established  this 
link  analytically. 
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6  Appendix:  Implementational  details 


-  We  have  found  it  beneficial  when  usmg  the  update  function  of  equations  2  and 
3  to  clip  the  activations  to  values  sinaUer  than  max  and  bigger  than  min.  This 
is  done  to  avoid  spurious  oscillations  that  occur  when  the  activations  are  close 
to  extreme  values  when  approximating  a  continuous  time  process  with  discrete 
time  steps.  In  oiu  simulations  we  did  not  let  the  activations  grow  bigger  than 
max  —  lambda  *  decay  or  smaller  than  min  +  lambda  *  decay. 

-  We  have  also  found  it  beneficial  to  use  non-extreme  teacher  values.  For 
instance  for  the  XOR  and  translation  problems  the  teachers  were  set  to  either 
-0.9  or  0.9  instead  of  -1.0  or  1.0  .  For  Boltzmann  mode  the  teachers  were  (-1, 


-f-l). 

-  As  discussed  in  Movellan  1990,  gradient  descent  calls  for  the  self  connections 
to  be  changed  at  half  the  rate  of  the  other^jBsight<>  Our  simulations  followed 
this  rule. 

-  The  random  restarts  for  each  settling  wher^one  by  reinitializing  the  acti¬ 
vations  to  uniformly  distributed  randoiisvaJaa  within  the  allowable  activation 
range. 

-  The  error  criterion  to  stop  the  training  process  is  based  on  the  foUowing 
information  gain  functiony^P^^^i^ea^irp'^t^dpfigpd-^^ 


£  /  P4..ir.4logp:ii^  (18) 

fatitrnt  «<•(<«  ^ ftta$n0tl 


Were  P  stands  for  probability  density.  This  measure  is  the  contbuous  version 
of  the  discrete  bformation  gain  function  used  b  discrete  Boltzmann  machbes. 
In  practice,  we  approximate  the  btegral  by  defining  a  region  sourroundbg  each 
of  the  desired  distributed  states  and  assessing  the  actual  probabilility  that  the 
activations  fall  within  that  region.  For  the  XOR  a  unitwise  tolerance  of  0.4  was 
used.  With  targets  set  at  ±0.9  the  activation  of  the  output  unit  had  to  be  above 
0.5  for  +0.9  teachers  and  bellow  -0.5  for  -0.9  teachers  to  fall  witbb  the  target 
region.  For  the  XOR  simulation,  a  value  of  information  gab  of  0.04  corresponds 
quite  closely  to  an  average  probability  of  0.99  that  the  network  is  b  the  target 
state  accross  the  10  restarts  x  100  cycles  x  4  patterns. 
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activation  activation  activation  activation  activation 


Figure  1 :  Output  unit  activations  after  300  learning  epochs.  Each  r/fl^ 
represents  the  activation  of  one  output  unit  throughout  10,000  (iycleer 


Rgure  2:  Probability  distribution  of  the  first  three  output  units  (top  line);  and  the  last 
two  output  units  (bottom  line)  throughout  learning. 


